Final Project: A Model to Estimate HR Scores (2020-)

Data Science 2 with R (STAT 301-2)

Author

Brian Key

Published

March 13, 2024

1 Introduction

1.1 Prediction Problem

Two of my three dissertation chapters rely on the use of a dependent variable measuring human rights respect. Hitherto, I have depended on Fariss’s Human Rights (HR) Scores,1 but he has not updated his dataset since 2020.2 This poses a problem insofar as it excludes the most recent country-year observations from my universe of cases entirely.3 In my previous final project, and by way of exploratory data analyses (EDAs), I identified a “raw” form of the physical violence index (PVI)—a simple average of the freedoms from political killings and torture indicators featured in the Varieties of Democracy (V-Dem) dataset—as a potential substitute for HR scores. However, for this project, I aim to predict outright the missing data in HR scores whilst more formally appraising the suitability of the raw PVI as a simple substitute for HR scores.

  • 1 For the original article wherein these scores were introduced, see Fariss, 2014.

  • 2 See his Dataverse page for evidence.

  • 3 That is, it is impossible to evaluate the relationship(s) between independent and dependent variables for cases where the latter is systematically missing.

  • 1.2 Key Findings

    After implementing and assessing a diversity of methods to predict HR scores, as summarized in this report, I render the following conclusions:

    1. There exist more accurate ways to approximate HR scores than to use the raw PVI in lieu of it (see Section 5).
    2. Of these, the best-performing is a KNN-regression model (see Section 6).
    3. Time-series models exhibit even lower mean prediction errors, yet evidence suggests they may feature greater variance vis-à-vis the KNN model (see Section 8).

    1.3 Overview of Report

    The remainder of my report is organized as follows:

    1. An overview of HR scores and V-Dem, including an EDA of the former (Section 2).
    2. A description of initial data preprocessing (Section 3).
    3. A summary of the recipes used (Section 4).
    4. The results of the baseline fits (Section 5).
    5. The results of the tuned fits (Section 6).
    6. The final fit’s predictions and an analysis thereof (Section 7).
    7. A discussion of time-series predictions and how they compare to those of the final fit (Section 8).
    8. Appendices (Section 9).

    2 Data Overview

    2.1 HR Scores

    The HR Scores dataset, which features the HR scores variable,4 consists of 11275 country-year observations spanning the period 1946-2019. The HR scores variable lacks missingness, with every observation possessing an HR score.

  • 4 This variable is denominated theta_mean in the original dataset. For the remainder of the report, I use “HR scores” to refer to this specific variable, not the dataset from which it originates.

  • 2.1.1 EDA of HR Scores

    The distribution of HR scores may be visualized as follows:

    Figure 1: Distribution of HR Scores (Density & Box Plots)

    Figure 1 shows that the variable isn’t distributed in a perfectly symmetrical manner, exhibiting a slight right-skew and a degree of bimodality. Nevertheless, these features seem relatively minor, and most methods available to correct them do not apply given the nature of the variable.5 I also attempted a Yeo-Johnson transformation of HR scores, but it failed to adequately rectify the problems of skewness and bimodality.6 In view of these considerations, I proceed by keeping HR scores unaltered.

  • 5 Namely, square-root, log, and Box-Cox transformations only work for positive variables, but approximately half of HR Scores’ values are negative.

  • 6 For more, see Figure 9

  • 2.2 V-Dem

    The 2024 version of V-Dem (V14) fully consists of 4607 variables and 27734 country-year observations spanning the period 1789-2023. 12185 of these observations cover the years 1946-2019, whereas 716 of them cover 2020-2023. Overall, the dataset features an appreciable degree of missingness, with approximately 26.36% of values in the 1946-2019 period being absent.7 However, rates of missingness are vastly lower for the five “high-level” democracy indices (approximately 0.30%) and the twenty-one “mid-level” components of said indices (approximately 0.59% ).8 These missingness rates are reduced even further following pre-recipe preprocessing (see Section 3).

  • 7 This figure is calculated from all non-ID variables. See scripts/1b_data_quality_check.R for the computation.

  • 8 See scripts/1b_data_quality_check.R again for the relevant computations.

  • 3 Pre-Recipe Preprocessing

    Prior to writing my recipes, I complete the following noteworthy preprocessing steps:

    • Merge V-Dem and HR Scores.9
    • For V-Dem, recode Czechoslovakia’s pre-Velvet Divorce (1992) observations as those of the Czech Republic.10 Doing so rectifies widespread missingness in this case. I deemed this justifiable on two counts:
      1. Fariss assigns the Czech Republic COW code to Czechoslovakia.
      2. V-Dem denominates both Czechoslovakia and the Czech Republic “Czechia”; there is no overlap in naming conventions or otherwise with Slovakia, which doesn’t appear in the dataset until 1993.
    • Given the wholesale absence of predictors and thus the virtual impossibility of prediction-making, remove country-years that appear in HR scores but not in V-Dem. These cases appear in Table 1, below, where the value on “N Missing” is positive:
  • 9 My actual code also merges these datasets with the Political Terror Scale (PTS) dataset. At the outset, I desired to include variables from the latter in my models, as they were explicitly noted as inputs for the original HR Scores; however, they featured widespread missingness, to the extent that imputing their values proved to be computationally infeasible.

  • 10 I.e., recode cowcode == 315 (Czechoslovakia) to cowcode == 316 (Czech Republic).

  • Table 1: N Missing from V-Dem by Country
    • These observations mainly involve:
      1. Post-Soviet states (e.g., Armenia, Azerbaijan, etc.) and post-WWII states (e.g., West and East Germany).
      2. European microstates (e.g., Liechtenstein and Monaco) and small island states (e.g., Samoa, St. Vincent and the Grenadines, etc.).
    • The former is more tolerable, for they stem from minor discrepancies in coding start/end dates and only pertain to a handful of years. The latter invites more reason for pause; indeed, to delete these observations is effectively to remove the world’s smallest countries from the ambit of my research entirely. Nevertheless, it doesn’t make sense, in my estimation, to retain them when there exist no data in V-Dem whatsoever—including the most basic of figures, such as population, GDP, etc.—on which even somewhat-credible imputations might be based.
    • Select as predictors the variables I identified in my previous final project as being related in theory to HR scores, as well as the entirety of V-Dem’s high-level and medium-level indices. I do so primarily to minimize my computational workload, but also under the intuition that including variables beyond this set would yield diminishing returns with respect to accuracy.11
  • 11 That is, including more variables would bring fewer improvements to predictive accuracy at the expense of computational speed.

  • 12 My saved preprocessed dataset actually contains 52 variables, but three of these are PTS metrics that are removed at the beginning of recipe preprocessing.

  • 13 Compare with Figure 1. Admittedly, the distribution might be marginally more right-skewed; this likely owes itself to the removal of the European microstates, whose human rights levels are generally very high.

  • Upon considering and completing these steps, I move forward with a preprocessed dataset of 49 variables and 10524 observations.12 The plot below demonstrates that the preprocessing steps involving HR scores does not significantly alter the distribution thereof:13

    Figure 2: Distribution of HR Scores, Preprocessed (Density & Box Plots)

    The preprocessed dataset also exhibits an exceedingly low degree of missingness relative to that of V-Dem overall (see Section 2), with only about about 0.19% of values from non-ID variables being missing. Figure 3, below, summarizes this missingness by predictor:

    Figure 3: Number of Missing Values by Predictor, Preprocessed Dataset

    Ultimately, in virtue of the minimal extent of missingness, imputations are unlikely to prove computationally demanding or to appreciably qualify the credibility of predictions generated therefrom.

    3.1 Data Splitting Specifications

    Upon completion of these preprocessing steps, I split the data with proportions of 75-25 training/testing, 90-10 analysis/assessment. The cross-validation split is repeated five times, resulting in a total of 50 folds.

    4 Recipes

    4.1 Baselines

    My null (null) and basic baseline (lm) models (see Section 5) draw on three recipes, each distinguished by the number of neighbors set for KNN-imputation of missing values: 5, 10, and 20, respectively.14 I establish and test these recipe variants, at this stage, to appraise the extent to which changes in the number of neighbors set for imputation effectuate changes in model performance.

  • 14 I considered bagged-tree imputation as well, but it proved computationally infeasible.

  • Aside from KNN-imputation, the main preprocessing components shared by each recipe-group are as follows:

    • Transforming country ID (cowcode) and year to a dummy variable.
    • Log-transforming population, GDP, and GDP-per-capita.15
    • Removing the year dummies subsequent to KNN-imputation.
    • Normalizing all numeric predictors.
  • 15 These variables are widely-known as right-skewed, and they are indeed so in V-Dem. For evidence that I have verified this, see scripts/1b_data_quality_check.R.

  • 16 Put differently, because year is a factor, there is no way to estimate the relationship between “2023,” “2024,” etc. and HR Scores when there exists wholesale missingness of HR scores for those years.

  • The penultimate of these steps is important. As a factor, year can continue to be used as an imputation predictor for V-Dem observations, because there will almost certainly be V-Dem data for future years; yet it cannot be used to predict HR Scores, for that variable ends in 2019.16

    In aggregate, the preprocessing steps results in a training set of 226 variables: 1 outcome, 2 ID variables,17 and 223 predictors—179 of which are country dummies.

  • 17 country_name and cow_year

  • Also included as a baseline is a fourth recipe (identified in Section 5 as akt_lm), which simply avails itself of my unscaled Political Violence Index (PVI), created in my previous final project, as the sole predictor of HR scores. I do so in order to test the appropriateness of merely substituting it for HR scores.

    4.2 Tuning Models

    Additionally, I create recipes for six tuning models (see Section 6):

    1. Ridge
    2. Lasso
    3. Elastic net
    4. KNN
    5. Random forest
    6. Boosted tree

    These recipes can be further categorized as either “kitchen-sink-” or “feature-engineering-” based:

    4.2.1 Kitchen Sink

    The kitchen-sink recipes are coterminous with the null and basic-baseline recipes, leveraging the full set of avaialbe predictors whilst eschewing nonessential preprocessing. The kitchen sink recipe for the KNN, random forest, and boosted tree models uses a recipe nearly-identical to that of the five-neighbor baselines, the only difference being the inclusion of one-hot encoding for the country-ID dummy variables.18

  • 18 This means that the total number of predictors utilized by the “tree” models is 224, not 223.

  • 4.2.2 Feature Engineering

    The feature-engineering recipes either expand on or curtail the universe of predictors available in the kitchen-sink recipes, above.

    The first simply includes interaction terms between GDP per capita and the four egalitarian indices, the logic being that greater GDP per capita is associated with greater respect for human rights conditional on the presence of high levels of equality.19 This recipe is only applied to the ridge, lasso, and elastic-net models, bringing the total number of predictors in these instances to 227.20

  • 19 Put differently, I theorize that the correlation between mean wealth and human-rights respect is lower when wealth and power is unequally distributed, as may be such in many of the Gulf states.

  • 20 Interaction terms are generally unnecessary for tree-based or KNN models, hence why this recipe is not applied to them.

  • 21 I experimented with step_corr() as a means of dealing with multicollinearity, but doing so failed to remove any variables from my prepped and baked dataset.

  • 22 The egalitarian component index is one of these removed variables, hence why the interaction term involving it is likewise removed.

  • 23 See appendix for correlation graph.

  • The second retains most of these interaction terms and is again applied to the ridge, lasso, and elastic-net models, but is more multicollinearity averse,21 removing variables that simply average already-present variables (e.g., the unscaled PVI), are by nature aggregative (e.g., the high-level democracy indices), or are merely variants of already-present variables (e.g., the ordinal versions of the civil liberties index).22 Variables exhibiting low degrees of correlation with HR scores (e.g., the de jure suffrage index) are removed as well.23 In all, the number of predictors that this preprocessing yields is 207.

    The third takes the foregoing recipe and prepares it for the KNN, random forest, and boosted tree models, removing the interaction terms featured therein whilst also including one-hot encoding for the dummy variables. Accordingly, the number of predictors available in this recipe is 205.

    4.3 Summary

    To recap, the recipes I deploy and their apportionment to the models can be organized as follows:

    1. Kitchen Sink:
      • Main:
        • 5 neighbors: basic baseline, null
        • 10 neighbors: basic baseline, null
        • 20 neighbors: basic baseline, null
      • Tree:
        • 5 neighbors: KNN, random forest, boosted tree
    2. Feature Engineering:
      • Main:
        • 5 neighbors, interactions only: ridge, lasso, elastic net
        • 5 neighbors, multicollinearity averse: ridge, lasso, elastic net
      • Tree:
        • 5 neighbors, multicollinearity averse: KNN, random forest, boosted tree
    3. Unscaled PVI

    These amount to eight recipes in total: four “kitchen sink,” three “feature engineering,” and one additional baseline (unscaled PVI). Within the first of these recipe groups is nine models, the second nine, and the third one, bringing the total number of models assessed to 19.

    5 Baseline Fits

    Table 2, below, gives the performance of my baseline fits on the cross-validation folds as measured by mean RMSE:

    Table 2: Comparison of Mean RMSEs by Model
    Workflow ID Mean RMSE Std. Error
    neighbors_5_lm 0.5651164 0.0023461
    neighbors_10_lm 0.5653670 0.0023460
    neighbors_20_lm 0.5657787 0.0023566
    akt_lm 1.0097946 0.0042792
    neighbors_5_null 1.4652267 0.0045097
    neighbors_10_null 1.4652267 0.0045097
    neighbors_20_null 1.4652267 0.0045097

    As we can see, the null models perform well-worse than the basic baseline models; the unscaled PVI baseline splits the difference between the two.

    There appear to be two important takeaways at this juncture. First, the basic baseline outperforms the unscaled PVI baseline, meaning we can proceed knowing that there exist better ways to estimate HR scores than to use the unscaled PVI as a simple substitute for HR scores. Second, changes in the number of neighbors used for KNN-imputation seem to have little effect on model performance. This is good insofar as it is no longer a concern of ours; we will proceed with the step_impute_knn() default of neighbors = 5 for all remaining workflows, though we theoretically could have selected a larger (or smaller) number with marginal impact on our findings.

    6 Tuned Fits

    Tuned are the following hyperparameters on the cross-validation folds:

    1. Ridge: penalty
    2. Lasso: penalty
    3. Elastic net: penalty and mixture
    4. KNN: neighbors
    5. Random forest: mtry and min_n
    6. Boosted tree: mtry, min_n, and learn_rate

    For the creation of the respective tuning grids, the first four use levels = 10, whereas the latter two use levels = 5. The random forest model further uses trees = 1000 and an mtry range set to c(1, 15), whereas the boosted tree model uses the same mtry range and a learn_rate range set to c(-3, -0.2).

    Taken from the most expansive recipe available,24 the processing times for these models—with parallel processing across eight cores where possible—were approximately as follows:25

  • 24 I.e., kitchen sink or feature engineering with interactions only.

  • 25 I simply timed these with my computer’s clock.

    1. Ridge: 5 minutes
    2. Lasso: 4 minutes
    3. Elastic Net: 9 minutes
    4. KNN: 12 minutes
    5. Random Forest: 92 minutes
    6. Boosted Tree: 65 minutes

    6.1 Round 1: Kitchen Sink or Interactions Only

    Below is Table 3, which gives the best mean RMSE of each tuned fit. These results are united in that their underlying recipes allow for the greatest number of predictors possible.26

  • 26 Recall that the kitchen sink recipe produces the greatest number of predictors for the KNN, random forest, and boosted tree models, whereas the feature-engineering with interactions-only recipe produces the greatest number of predictors for the ridge, lasso, and elastic net models. For more, see Section 4.

  • Table 3: Comparison of Best Mean RMSEs by Model (Kitchen Sink or Interactions Only)
    Model Mean RMSE Std. Error
    knn 0.2429732 0.0028678
    rf 0.2869661 0.0018060
    bt 0.4767500 0.0025412
    en 0.5587930 0.0022834
    lasso 0.5589801 0.0022842
    ridge 0.5813393 0.0022810

    The optimal tuning parameters for these models as assessed by mean RMSE are, respectively:

    • KNN: neighbors = 2
    • Random Forest: mtry = 15, min_n = 2
    • Boosted Tree: mtry = 15, min_n = 2, learn_rate = 0.631
    • Elastic Net: penalty = 1e-10, mixture = 0.778
    • Lasso: penalty = 1e-10
    • Ridge: penalty = 1e-10

    Figure 4, below, depicts the confidence intervals for each RMSE estimate:

    Figure 4: Ranking of Best Mean RMSEs with Confidence Intervals, Round 1

    6.2 Round 2: Multicollinearity Averse

    Below is Table 4, which gives the best mean RMSE of each tuned fit. These results are united in that their underlying recipes are multicollinearity averse, and as such draw from a smaller set of predictors.27

  • 27 For a summary of how variables were selected for removal, see Section 4.

  • Table 4: Comparison of Best Mean RMSEs by Model
    Model Mean RMSE Std. Error
    knn 0.2405136 0.0030602
    rf 0.3061221 0.0016808
    bt 0.5048741 0.0025192
    en 0.5734804 0.0024225
    lasso 0.5734929 0.0024186
    ridge 0.5882699 0.0022959

    The optimal tuning parameters for these models as assessed by mean RMSE are, respectively:

    • KNN: neighbors = 2
    • Random Forest: mtry = 15, min_n = 2
    • Boosted Tree: mtry = 15, min_n = 2, learn_rate = 0.631
    • Elastic Net: penalty = 0.000464, mixture = 0.889
    • Lasso: penalty = 0.000464
    • Ridge: penalty = 1e-10

    Figure 5, below, depicts the confidence intervals for each RMSE estimate:

    Figure 5: Ranking of Best Mean RMSEs with Confidence Intervals, Round 2

    6.3 Conclusion

    In each round, the best KNN model clearly outperforms the rest, yet it is with the multicollinearity-averse recipe that the best KNN model exhibits the lowest RMSE.28 As such, we proceed with our final fit by selecting the best KNN model (neighbors = 2) and the multicollinearity-averse recipe.

  • 28 Compare, in particular, the result in Table 3 and Table 4, respectively.

  • 7 Final Fit

    Table 5, below, gives the performance metrics for the final fit applied to the testing set:

    Table 5: Performance Metrics from Final Fit
    Metric Estimate
    rmse 0.2305683
    mae 0.1225338
    mape 59.8248677
    rsq 0.9741917

    The RMSE is even lower when compared to that of the cross-validation folds (Table 4). The MAE is about half has large as the RMSE; and at approximately 0.974, the \(R^{2}\) statistic is exceedingly high. (The MAPE is about 59.8%, but this is a small figure in absolute terms, the preponderance of HR scores being clustered around 0.)

    The RMSE of approximately 0.23 represents an exceedingly marginal difference. For the full set of HR Scores, the minimum is approximately -3.46, while the maximum is approximately 5.34, for a range of approximately 8.8; the RMSE’s range,29 then, represents only about 5.24% of the full range. Recalling that HR Scores are themselves estimates with a mean standard deviation of approximately 0.33 lends further credibility to the accuracy of our model.30 Qualitatively, a score difference of 0.23 does not seem to mean much either. Indeed, the examples of countries that experienced such a score change did not witness, to my knowledge, any appreciable changes in their underlying political milieus.31

  • 29 Computed (approximately) by \(0.23*2\).

  • 30 For the code computing this figure, see scripts/6_assess_final_model.R.

  • 31 For these examples, see scripts/6_assess_final_model.R.

  • To round off this discussion is Figure 6, below, which depicts the relationship between our predictions and actual values in the testing set, illustrating the tight fit—and hence high degree of accuracy—of our final model:

    Figure 6: Actual vs. Predicted HR Scores, Testing Set

    8 Time-Series Predictions

    8.1 Overview

    Going beyond the requirements of this assignment, I extend my analysis by not only building time-series models to predict HR scores, but also predicting HR scores with the final fit applied to more recent data. In doing so, I aim to 1.) assess whether even more accurate methods of estimating HR scores may exist, and 2.) demonstrate that using my final KNN model on more recent data is achievable.32

  • 32 My work for this section can be viewed in scripts/7*.R

  • 8.2 Preprocessing

    Before running any models, I preprocess my 1946-2019 data such that it is suitable for time-series analyses. To do so, I convert the original preprocessed dataset (see Section 3) to a tsibble object, prior to which I rectify missingness with KNN imputation.33 Furthermore, for the final fit, I create a version of the original preprocessed dataset that covers the years 2020-2023, exclusively.

  • 33 This is mainly done to address a handful of cases for population and GDP per capita. The scale of missingness here is infinitesimal, such that imputation variations are highly unlikely to affect the predictions. I considered interpolation as a means of dealing with this missingness; but the predictions, in my estimation, seemed to deviate too greatly from what were the likely values.

  • 8.3 Methods

    The methods I use to construct my time series models are generally set forth in the textbook Forecasting: Principles and Practice (Hyndman & Athanasopoulos, 2021), which itself makes heavy use of the fable and tsibble R packages.

    First, I use time series cross-validation to compare univariate ARIMA34 and ETS35 models in predicting HR scores, population, and GDP per capita for the period 2020-2023, for all three are systematically missing therein.36 Next, I select the best model for each time series (i.e., country) and render predictions therefrom. Finally, I apply my final fit to the 2020-2023 data—my population and GDP-per-capita predictions having been appended thereto—and compare the predictions from the pure time-series and KNN-model methods.

  • 34 Autoregressive integrated moving average

  • 35 Exponential smoothing

  • 36 My actual code also predicts GDP, though this is unnecessary given that the variable is removed in the final KNN model. Interestingly, for the V-Dem dataset, population, GDP, and GDP per capita are Fariss variables as well, and they too have not been updated since 2020. For evidence, see the V-Dem Codebook (V14).

  • 8.4 Main Results

    With respect to HR scores, time series cross-validation revealed that ETS was better-performing for 153 countries and ARIMA better for 27 countries. The mean RMSE of these “best” models was approximately 0.188, which compares favorably to the mean RMSEs of the best KNN models (see Table 3 and Table 4).37 HR-score predictions for 2020-2023 were successfully rendered from not only these models, but also the best KNN model, which (as aforementioned) relied on ARIMA and ETS models to impute missing data for population and GDP per capita.38 Each set of predictions are compared in Figure 7, below:

  • 37 This mean RMSE was computed by 1.) taking the mean RMSE for each best model by country, 2.) multiplying these means by the proportion of observations each country contributes to the dataset, and 3.) summing the products. This is done to better approximate the process whereby tidymodels computes mean RMSEs, which takes the number of rows—not cases—to be \(n\), the final denominator.

  • 38 The remaining missing values were imputed through KNN imputation, as dictated by the underlying recipe.

  • Figure 7: Time Series vs. KNN Predictions, 2020-2023

    Generally, the two methods yield similar predictions, although there appears to be more variability where HR scores is average to below-average. Comparing the distribution of each set of predictions evinces even greater differences between the two:

    Figure 8: Time Series vs. KNN Density Plots

    In Figure 8, above, the distribution of the time-series predictions appears exceedingly similar to the distribution of HR scores observed in the original preprocessed dataset (see Figure 1), being centered at 0 and exhibiting an additional mode below the center. By contrast, the distribution of the KNN predictions is centered higher, at approximately 1.3, which seems to comport with Fariss’s (2014) finding: that human rights respect is improving over time. Together, these observations suggest that the time-series models are animated by a memorization of the original data—especially vis-à-vis the KNN model, which by construction can allow new data and trends to inform its predictions. Put differently, the time-series models may be exhibiting lower bias yet higher variance, whereas the KNN model may be exhibiting the opposite. As such, my preliminary suspicion is that the KNN model will offer more robust predictions than the time-series models will.

    9 Appendices

    9.1 Distribution of Yeo-Johnson-transformed HR scores

    Figure 9: Distribution of Yeo-Johnson-transformed HR Scores (Density & Box Plots)

    The transformation does rectify the right-skew seen in Figure 1, if at the expense of introducing a slight left-skew. It also fails to eliminate the bimodality feature.